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Abstract. 

A formalism for describing the dynamics of Genetic Algorithms (GAs) using 
methods from statistical mechanics is applied to the problem of generalization in a 
perceptron with binary weights. The dynamics are solved for the case where a new 
batch of training patterns is presented to each population member each generation, 
which considerably simplifies the calculation. The theory is shown to agree closely 
to simulations of a real GA averaged over many runs, accurately predicting the mean 
best solution found. For weak selection and large problem size the difference equations 
describing the dynamics can be expressed analytically and we find that the effects of 
noise due to the finite size of each training batch can be removed by increasing the 
population size appropriately. If this population resizing is used, one can deduce the 
most computationally efficient size of training batch each generation. For independent 
patterns this choice also gives the minimum total number of training patterns used. 
Although using independent patterns is a very inefficient use of training patterns in 
general, this work may also prove useful for determining the optimum batch size in the 
case where patterns are recycled. 



1. Introduction 

Genetic Algorithms (GAs) are adaptive search techniques, which can be used to find 



low energy states in poorly characterized, high- dimensional energy landscapes || |T2|1 . 
They have already been successfully applied in a large range of domains || and a review 
of the literature shows that they are becoming increasingly popular. In particular, GAs 
have been used in a number of machine learning applications, including the design and 



training of artificial neural networks pi 20, 30 
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In the simple GA considered here, each population member is represented by a 
genotype, in this case a binary string, and an objective function assigns an energy 
to each such genotype. A population of solutions evolves for a number of discrete 
generations under the action of genetic operators, in order to find low energy (high 
fitness) states. The most important operators are selection, where the population is 
improved through some form of preferential sampling, and crossover (or recombination), 
where population members are mixed, leading to non-local moves in the search space. 
Mutation is usually also included, allowing incremental changes to population members. 
GAs differ from other stochastic optimisation techniques, such as simulated annealing, 
because a population of solutions is processed in parallel and it is hoped that this 
may lead to improvement through the recombination of mutually useful features from 
different population members. 

A formalism has been developed by Priigel-Bennett, Shapiro and Rattray which 
describes the dynamics of a simple GA using methods from statistical mechanics [O, 



16, 17, 18|. This formalism has been successfully applied to a number of simple Ising 



systems and has been used to determine optimal settings for some of the GA search 



parameters |[23|| . It describes problems of realistic size and includes finite population 
effects, which have been shown to be crucial to understanding how the GA searches. 
The approach can be applied to a range of problems including ones with multiple optima, 
and it has been shown to predict simulation results with high accuracy, although small 
errors can sometimes be detected. 

Under the statistical mechanics formalism, the population is described by a small 
number of macroscopic quantities which are statistical measures of the population. 
Statistical mechanics techniques are used to derive deterministic difference equations 
which describe the average effect of each operator on these macroscopics. Since the 
dynamics of a GA is to be modelled by the average dynamics of an ensemble of GAs, 
it is important that the quantities which are used to describe the system are robust 
and self- averaging. The macroscopics which have been used are the cumulants of some 
appropriate quantity, such as the energy or the magnetization, and the mean correlation 
within the population, since these are robust statistics which average well over different 
realizations of the dynamics. There may be small systematic errors, since the difference 
equations for evolving these macroscopics sometimes involve nonlinear terms which may 
not self-average, but these corrections are generally small and will be neglected here. 

The statistical mechanics theory is distinguished by the facts that a macroscopic 
description of the GA is used and that the averaging is done such that fluctuations can 
be included in a systematic way. Many other theoretical approaches are based on the 
intuitive idea that above average fitness building blocks are preferentially sampled by 
the GA, which, if they can be usefully recombined, results in highly fit individuals being 
produced P, O. Although this may be a useful guide to the suitability of particular 
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problems to a GA, it is difficult to make progress towards a quantitative description for 
realistic problems, as it is difficult to determine which are the relevant building blocks 
and which building blocks are actually present in a finite population. This approach 
has led to false predictions of problem difficulty, especially when the dynamic nature of 
the search is ignored 0, [TO]. A rigorous approach introduced by Vose et al describes 
the population dynamics as a dynamical system in a high-dimensional Euclidean space, 
with each genetic operator incorporated as a transition tensor [^, |28| . This method 
uses a microscopic description and is difficult to apply to specific problems of realistic 
size due to high-dimensionality of the equations of motion. More recently, a number 
of results have been derived for the performance of a GA on a class of simple additive 
problems f|, |22fl . These approaches use a macroscopic description, but assume a 
particular form for the distribution of macroscopics which is only applicable in large 
populations and for a specific class of problem. It is difficult to see how to transfer the 
results to other problems where finite population effects cannot be ignored. 

Other researchers have introduced theories based on averages. A description of 
GA dynamics in terms of the evolution of the parent distribution from which finite 



populations are sampled was produced by Vose and Wright [p9[ . This microscopic 
approach provides a description of the finite population effects which is elegant and 
correct. However, like other microscopic descriptions it is difficult to apply to specific 
realistic problems due to the enormous dimensionality of the system. Macroscopic 
descriptions can result in low-dimensional equations which can be more easily studied. 
Another formalism based on the evolution of parent distributions was developed by Peck 
and Dhawan [IJj], but they did not use the formalism to develop equations describing 
finite population dynamics. 

The importance of choosing appropriate quantities to average is well-known in 
statistical physics, but does not seem to be widely appreciated in genetic algorithm 
theory. In particular, many authors use results based on properties of the average 
probability distribution; this is insensitive to finite-population fluctuations and only 
gives accurate results in the infinite population limit. Thus, many results are only 
accurate in the infinite population limit, even though this limit is not taken explicitly. 



For example, Srinivas and Patnaik |25] and Peck and Dhawan |14[ both produce 
equations for the moments of the fitness distribution in terms of the moments of the 
initial distribution. These are moments of the average distribution. Consequently, the 
equations do not correctly describe a finite population and results presented in these 
papers reflect that. Other attempts to describe GAs in terms of population moments (or 
schema moments or average Walsh coefficients) suffer from this problem. Macroscopic 
descriptions of population dynamics are also widely used in quantitative genetics (see, 
for example, reference 0). In this field the importance of finite-population fluctuations 
is more widely appreciated; the infinite population limit is usually taken explicitly. Using 
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the statistical mechanics approach, equations for fitness moments which include finite- 
population fluctuations can be derived by averaging the cumulants, which are more 
robust statistics. 

Here, the statistical mechanics formalism is applied to a simple problem from 
learning theory, generalization of a rule by a perceptron with binary weights. The 
perceptron learns from a set of training patterns produced by a teacher perceptron, also 
with binary weights. A new batch of training patterns are presented to each population 
member each generation which simplifies the analysis considerably, since there are 
no over-training effects and each training pattern can be considered as statistically 
independent. Baum et al have shown that this problem is similar to a paramagnet 
whose energy is corrupted by noise and they suggest that the GA may perform well 
in this case, since it is relatively robust towards noise when compared to local search 
methods ||. The noise in the training energy is due to the finite size of the training set 
and is a feature of many machine learning problems || . 

We show that the noise in the training energy is well approximated by a Gaussian 
distribution for large problem size, whose mean and variance can be exactly determined 
and are simple functions of the overlap between pupil and teacher. This allows the 
dynamics to be solved, extending the statistical mechanics formalism to this simple, yet 
non-trivial, problem from learning theory. The theory is compared to simulations of a 
real GA averaged over many runs and is shown to agree well, accurately predicting the 
evolution of the cumulants of the overlap distribution within the population, as well as 
the mean correlation and mean best population member. In the limit of weak selection 
and large problem size the population size can be increased to remove finite training set 
effects and this leads to an expression for the optimal training batch size. 

2. Generalization in a perceptron with binary weights 

A perceptron with Ising weights Wi G {—1, 1} maps an Ising training pattern {Cf } onto 
a binary output, 



where N is the number of weights. Let ti be the weights of the teacher perceptron and 
Wi be the weights of the pupil. The stability of a pattern is a measure of how well it is 
stored by the perceptron and the stabilities of pattern \i for the teacher and pupil are 
A^ and A£, respectively, 




1 



for x > 



for x < 



(1) 




(2) 
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The training energy will be defined as the number of patterns the pupil misclassifies, 

f 1 for x > 

£=$>(-AfA£) 6(*) = " (3) 

~i [ for x < 

where AiV is the number of training patterns presented and Q(x) is the Heaviside 
function. In this work a new batch of training examples is presented each time the 
training energy is calculated. 

For large iV it is possible to calculate the entropy of solutions compatible with the 
total training set and there is a first-order transition to perfect generalization as the 
size of training set is increased [11, 24]. This transition occurs for O(N) patterns and 
beyond the transition the weights of the teacher are the only weights compatible with 
the training set. In this case there is no problem with over-training to that particular 
set, although a search algorithm might still fail to find these weights. The GA considered 
here will typically require more than O(N) patterns, since it requires an independent 
batch for each energy evaluation, so avoiding any possibility of over-training. 

Define R to be the overlap between pupil and teacher, 

1 N 

R=n^ Witi (4) 

iv i=l 

We choose = 1 at every site without loss of generality. If a statistically independent 
pattern is presented to a perceptron, then for large N the stabilities of the teacher and 
pupil are Gaussian variables each with zero mean and unit variance, and with covariance 
R, 

(K a ^ 1 f -(A 2 t -2RA t A w + Ap \ 

P{K Aw) = ^vf^w exp { — w^m — J (5) 

The conditional probability distribution for the training energy given the overlap is, 

p(E\R) = U(e -£e(-A?A£))\ (6) 

where the brackets denote an average over stabilities distributed according to the joint 
distribution in equation @. The logarithm of the Fourier transform generates the 
cumulants of the distribution and using the Fourier representation for the delta function 
in p(E\R) one finds, 

/oo 
dEp(E\R) e tE 
-oo 

= ^nexp[te(-A?A£)]\ 

/ 1 \ XN 
= (^-(^-ricos- 1 ^)] (7) 
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The logarithm of this quantity can be expanded in t, with the cumulants of the 
distribution given by the coefficients of the expansion. The higher cumulants are O(XN) 
and it turns out that the shape of the distribution is not critical as long as A is 0(1). 
A Gaussian distribution will be a good approximation in this case, 

where the mean and variance are, 

EJR) = —cos-^R) (9) 

7T 

a 2 = — cos - 1 (R)(i-l cos -\R)) (10) 

Here, E g (R) is the generalization error, which is the probability of misclassifying a 
randomly chosen training example. The variance expresses the fact that there is noise 
in the energy evaluation due to the finite size of the training batch. 

3. Modelling the Genetic Algorithm 

3.1. The Genetic Algorithm 

Initially, a random population of solutions is created, in this case Ising weights of the 
form {wi,w 2 ■ ■ ■ ,wn} where the alleles Wi are the weights of a perceptron. The size 
of the population is P and will usually remain fixed, although a dynamical resizing of 
the population is discussed in section |7]. Under selection, new population members are 
chosen from the present population with replacement, with a probability proportional 
to their Boltzmann weight. The selection strength (3 is analogous to the inverse 
temperature and determines the intensity of selection, with larger j3 leading to a 
higher variance of selection probabilities @, l6[ . Under standard uniform crossover, 



the population is divided into pairs at random and the new population is produced 
by swapping weights at each site within a pair with some fixed probability. Here, 
bit-simulated crossover is used, with new population members created by selecting 
weights at each site from any population member in the original population with equal 
probability [p6f] . In practice, the alleles at every site are completely shuffled within 
the population and this brings the population straight to the fixed point of standard 
crossover. This special form of crossover is only practicable here because crossover does 
not change the mean overlap between pupil and teacher within the population. Standard 
mutation is used, with random bits flipped throughout the population with probability 

Pm- 

Each population member receives an independent batch of XN examples from the 
teacher perceptron each generation, so that the relationship between the energy and the 
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overlap between pupil and teacher is described by the conditional probability defined in 
equation (^|). In total, XNxPG training patterns are used, where G is the total number 
of generations and P is the population size (or the mean population size). 



3.2. The Statistical Mechanics formalism 

The population will be described in terms of a number of macroscopic variables, the 
cumulants of the overlap distribution within the population and the mean correlation 
within the population. In the following sections, difference equations will be derived for 
the average change of a small set of these macroscopics, due to each operator. A more 
exact approach considers fluctuations from mean behaviour by modelling the evolution 



of an ensemble of populations described by a set of order parameters [yj. Here, it is 
assumed that the dynamics average sufficiently well so that we can describe the dynamics 
in terms of deterministic equations for the average behaviour of each macroscopic. This 
assumption is justified by the excellent agreement between the theory and simulations 
of a real GA, some of which are presented in section || Once difference equations are 
derived for each macroscopic, they can be iterated in sequence in order to simulate the 
full dynamics. 

Notice that although we follow information about the overlap between teacher and 
pupil, this is of course not known in general. The only feedback available when training 
the GA is the training energy defined in equation |3|. Selection acts on this energy, and 
it is therefore necessary to average over the noise in selection which is due both to the 
stochastic nature of the training energy evaluation and of the selection procedure itself. 

Finite population effects prove to be of fundamental importance when modelling the 
GA. A striking example of this is in selection, where an infinite population assumption 
leads to the conclusion that the selection strength can be set arbitrarily high in order 
to move the population to the desired solution. This is clearly nonsense, as selection 
could never move the population beyond the best existing population member. Two 
improvements are required to model selection accurately; the population should be finite 
and the distribution from which it is drawn should be modelled in terms of more than 



two cumulants, going beyond a Gaussian approximation [16]. The higher cumulants play 
a particularly important role in selection which will be described in section |5J] [[17]] . 

The higher cumulants of the population after bit-simulated crossover are determined 
by assuming the population is at maximum entropy with constraints on the mean overlap 



and correlation within the population (see [Appendix A| ). The effect of mutation on the 



mean overlap and correlation only requires the knowledge of these two macroscopics, 
so these are the only quantities we need to evolve in order to model the full dynamics. 
All other relevant properties of the population after crossover can be found from the 
maximum entropy ansatz. A more general method is to follow the evolution of a number 
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of cumulants explicitly, as in references fl?|, [18]], but this is unnecessary here because of 



the special form of crossover used, which is not appropriate in problems with stronger 
spatial interactions. 

3.3. The cumulants and correlation 

The cumulants of the overlap distribution within the population are robust statistics 
which are often reasonably stable to fluctuations between runs of the GA, so that they 
average well ||17|| . The first two cumulants are the mean and variance respectively, while 
the higher cumulants describe the deviation from a Gaussian distribution. The third and 
fourth cumulants are related to the skewness and kurtosis of the population respectively. 
A population member, labelled a, is associated with overlap R a defined in equation (f|). 
The cumulants of the overlap distribution within a finite population can be generated 
from the logarithm of a partition function, 

Z=f>xp( 7 i? a ) (11) 

a=l 

where P is the population size. If n n is the nth cumulant, then, 

d n 

«n = lim — — log Z (12) 

The partition function holds all the information required to determine the cumulants of 
the distribution of overlaps within the population. 

The correlation within the population is a measure of the microscopic similarity of 
population members and is important because selection correlates a finite population, 
sometimes leading to premature convergence to poor solutions. It is also important 
in calculating the effect of crossover, since this involves the interaction of different 
population members and a higher correlation leads to less disruption on average. The 
correlation between two population members, a and /3, is q a/ 3 and is defined by, 

1 N 

i^ = j^Y, w i w i ( 13 ) 

i=l 

The mean correlation is q and is defined by, 

2 p 

q = prp-n ^ ^ qaP (14) 

r \ r L ) a=l/3> a 

In order to model a finite population we consider that P population members are 
randomly sampled from an infinite population, which is described by a set of infinite 
population cumulants, K n [|15|. The expectation values for the mean correlation and 
the first cumulant of a finite population are equal to the infinite population values. The 
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higher cumulants are reduced by a factor which depends on the population size, 

K\ = K\ (15a) 
k 2 = P 2 K 2 (15b) 
k 3 = P 3 K 3 (15c) 
« 4 = P 4 ^4 - 6P 2 (K 2 ) 2 /P (15(f) 
Here, P 2 , P3 and P4 give finite population corrections to the infinite population result 



see reference |17| for a derivation) 



P 2 = l"^ ^3 = 1-| + ^ P 4 = l-| + ^-jj (16) 

Although we model the evolution of a finite population, it is more natural to follow the 
macroscopics associated with the infinite population from which the finite population is 
sampled |I5| . The expected cumulants of a finite population can be retrieved through 
equations ( |15q| ) to ( |15dp . 

4. Crossover and mutation 

The mean effects of standard crossover and mutation on the distribution of overlaps 



within the population are equivalent to the paramagnet results given in [17]. However, 
bit-simulated crossover brings the population straight to the fixed point of standard 
crossover, which will be assumed to be a maximum entropy distribution with the correct 



mean overlap and correlation, as described in |Appendix A|. To model this form of 



crossover one only requires knowledge of these two macroscopics, so these are the only 
two quantities we need to evolve under selection and mutation. 

The mean overlap and correlation after averaging over all mutations are, 

K? = (1 - 2 Pm )K 1 (17a) 
q m =(l-2p m ) 2 q (17b) 



where p m is the probability of flipping a bit under mutation |L7 |. The higher cumulants 
after crossover are required to determine the effects of selection, discussed in the next 
section. The mean overlap and correlation are unchanged by crossover and the other 
cumulants can be determined by noting that bit-simulated crossover completely removes 
the difference between site averages within and between different population members. 
For example, terms like (wfw^)i^j and (vofw^)^ are equal on average. After cancelling 
terms of this form one finds that the first four cumulants of an infinite population after 
crossover are, 

K{ = K x (18a) 
K c 2 = ^(l-q) (18b) 
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*5=-^(^g«>«) 



N 

18c) 



o N 

K t = - 1 1 - 4 ? + a? £«>* ] ( 18d ) 




Here, the brackets denote population averages. The third and fourth order terms in 
the expressions for the third and fourth cumulants are calculated in |Appendix A| by 



making a maximum entropy ansatz. The expected cumulants of a finite population 
after crossover are determined from equations ( |15a[) to ( |15d|) . 



5. The cumulants after selection 

Under selection, P new population members are chosen from the present population 
with replacement. Following Priigel-Bennett we split this operation into two stages [15[ . 



First we randomly sample P population members from an infinite population in order 
to create a finite population. Then an infinite population is generated from this finite 
population by selection. The proportion of each population member represented in the 
infinite population after selection is equal to its probability of being selected, which 
is defined below. The sampling procedure can be averaged out in order to calculate 
the expectation values for the cumulants of the overlap distribution within an infinite 
population after selection, in terms of the infinite population cumulants before selection. 

The probability of selecting population member a is p a and for Boltzmann selection 
one chooses, 

e -f3E a 

P° = E P e -PE a ( 19 ) 

where J3 is the selection strength and the denominator ensures that the probability is 
correctly normalized. Here, E a is the training energy of population member a. 
One can then define a partition function for selection, 
p 

Z s = ]Texp(-/?£ Q + 7j R Q ) (20) 

a=l 

The logarithm of this quantity generates the cumulants of the overlap distribution for 
an infinite population after selection, 

K ^ = Hd^ hgZs (21) 
One can average this quantity over the population by assuming each population member 
is independently selected from an infinite population with the correct cumulants, 



(log 3,) = H dR a dE aP (R a )p(E a \R a ) \ogZ s (22) 
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where p(E\R) determines the stochastic relationship between energy and overlap as 
defined in equation (^) which will be approximated by the Gaussian distribution in 
equation (P). Following Priigel-Bennett and Shapiro one can use Derrida's trick and 



express the logarithm as an integral in order to decouple the average [^, [T6|] 

e -t _ (Q-tZsj 



roo 

{\ogZ s }= / dt 
Jo 



'd,^!W (23) 

t 



where, 



f(t,P,i) = JdRdEp{R)p{E\R)exp[-te- (3E+ ^ R ) (24) 

The distribution of overlaps within an infinite population is approximated by a cumulant 
expansion around a Gaussian distribution |jT 



1 (-(R-Ki) 
p(R) = exp 



^ K n (R-K{ 

1+2^ U r 



K n J 2 \ VK~2 



(25) 



where u n (x) = (— l) n e^"^-e^ _ /n! are scaled Hermite polynomials. Four cumulants 
were used for the simulations presented in section |8| and the third and fourth Hermite 
polynomials are u^(x) = (x 3 — 3x)/3! and u±(x) = (x 4 — 6x 2 + 3)/4!. This function 
is not a well defined probability distribution since it is not necessarily positive, but it 
has the correct cumulants and provides a good approximation. In general, the integrals 
in equations (|23|) and (^4j) have to be computed numerically, as was the case for the 
simulations presented in section |[ 

5.1. Weak selection and large N 

It is instructive to expand in small (3 and large N, as this shows the contributions for each 
cumulant explicitly and gives some insight into how the size of the training set affects 
the dynamics. Since the variance of the population is 0(1/N) it is reasonable to expand 
the mean of p(E\R), defined in equation @, around the mean of the population in this 
limit (R ~ Ki). It is also assumed that the variance of p(E\R) is well approximated 
by its leading term and this assumption may break down if the gradient of the noise 
becomes important. Under these simplifying assumptions one finds, 



E S (R) «^ fcos-^) - (^2) 



(26) 

2 - XN cos-^Ki) (l - - cos- 1 ^)) (27) 



a' ~ - 

7T V 7T 



Following Priigel-Bennett and Shapiro one can expand the integrand in 
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equation (p3|) for small (3 (as long as A is at least 0(1) so that the variance of p(E\R) 
is 0(N)) } 

f(^,7) * expi-tPfciPn)) (l + ^f (p 2 (/3,7) - P?(/3,7))J (28) 

where, 

p n {/3,i) = JdRdEp(R)p{E\R)e n{ ~ (3E+ ' fR) (29) 

We approximate p(E\R) by a Gaussian whose mean and variance given in equations (p6|) 
and (0). Completing the integral in equation (|2"3"|), one finds an expression for the 
cumulants of an infinite population after selection, 



Kt = lim 



7^0 ^7 n 



log(P Pl (^, 7 )) 



2P Vp?(^,7), 



(30) 



where, 



Pn {kp, 1 )= dRp(R)e nR ^ + ^ 



= exp l£ 5i J (31) 

Here, a cumulant expansion has been used. The parameter is the constant of 
proportionality relating the generalization error to the overlap in equation ( p6| ) (constant 
terms are irrelevant, as Boltzmann selection is invariant under the addition of a constant 
to the energy). 

AiV , . 

(32) 



For the first few cumulants of an infinite population after selection one finds, 



K* = K X +\1- -p- J k{3K 2 + 0({3 2 ) (33a) 
Kl = ( 1 - K 2 + ( 1 - 6 -^\ k/3K 3 + 0(/3 2 ) (336) 

Kl = ( 1 - — p- j ^3 - k(3Kl + 0((3 2 ) (33c) 

The expected cumulants of a finite population after selection are retrieved through 
equations (|15oj) to ( |15cj) . For the zero noise case (a = 0) this is equivalent to selecting 
directly on overlaps (with energy —R), with selection strength k/3. We will therefore call 
k/3 the effective selection strength. It has previously been shown that this parameter 
should be scaled inversely with the standard deviation of the population in order to 
make continued progress under selection, without converging too quickly (T7|. Strictly 
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speaking, we can only use information about the distribution of energies since the 
overlaps will not be known in general, but to first order in R — K\ this is equivalent 
to scaling the selection strength inversely to the standard deviation of the energy 
distribution. As in the problems considered in reference , the finite population effects 
lead to a reduced variance and an increase in the magnitude of the third cumulant, 
related to the skewness of the population. This leads to an accelerated reduction in 
variance under further selection. The noise due to the finite training set increases the 
size of the finite population effects. The other genetic operators, especially crossover, 
reduce the magnitude of the higher cumulants to allow further progress under selection. 



6. The correlation after selection 



To model the full dynamics, it is necessary to evolve the mean correlation within 
the population under selection. This is rather tricky, as it requires knowledge of the 
relationship between overlaps and correlations within the population. To make the 
problem tractable, it is assumed that before selection the population is at maximum 
entropy with constraints on the mean overlap and correlation within the population, as 
discussed in Appendix A . The calculation presented here is similar to that presented 
elsewhere ||18|| , except for a minor refinement which seems to be important when 



considering problems with noise under selection. 

The correlation of an infinite population after selection from a finite population is 
given by, 

p p p 

a=l a=l/3=l 

= Ag d + goo (34) 

where p a is the probability of selection, defined in equation fllTf ). The first term is due 
to the duplication of population members under selection, while the second term is due 
to the natural increase in correlation as the population moves into a region of lower 
entropy. The second term gives the increase in the correlation in the infinite population 
limit, where the duplication term becomes negligible. An extra set of variables q aa are 
assumed to come from the same statistics as the distribution of correlations within the 
population. Recall that the expectation value for the correlation of a finite population 
is equal to the correlation of the infinite parent population from which it is sampled. 



6.1. Natural increase term 

We estimate the conditional probability distribution for correlations given overlaps 
before selection p(q a p\R ai Rp) by assuming the weights within the population are 
distributed according to the maximum entropy distribution described in |Appendix A . 
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Then is simply the correlation averaged over this distribution and the distribution 
of overlaps after selection, p B (R). 

<?oo = [dq a p dR a dR p p s (R a )p s (Rp)p(q a p\R cn R p ) q aP (35) 

This integral can be calculated for large N by the saddle point method and we find that 



in this limit the result only depends on the mean overlap after selection (see [Appendix 



1 "I H^+tanhfa) V 

fofe) = 7vg(lTH^hfe)j (36) 

where, 

K s _ 1 f m + tanh(y) 

The natural increase contribution to the correlation q^ is an implicit function of K\ 
through y, which is related to K\ by equation (|3~7l). Here, Wi is the mean weight at site 
% before selection (recall that we have chosen the teacher's weights to be tj = 1 at every 
site, without loss of generality) and for a distribution at maximum entropy one has, 

Wi = tanh(^ + xrji) (38) 

The Lagrange multipliers, z and x, are chosen to enforce constraints on the mean overlap 
and correlation within the population before selection and rji is drawn from a Gaussian 
distribution with zero mean and unit variance (see [Appendix A|) . 



It is instructive to expand in y, which is appropriate in the weak selection limit. In 
this case one finds, 

K\ = K{ + y(NK c 2 ) + V ^{N 2 Kl) + ■■■ (39) 

qoo(y) = q - y(N 2 K c 3 ) - ^(N 3 K C 4 ) + ■■■ (40) 

where K° are the infinite population expressions for the cumulants after bit-simulated 
crossover, when the population is assumed to be at maximum entropy (defined in 



equations (18a) to (|18cj) up to the fourth cumulant). Here, y plays the role of the 
effective selection strength in the associated infinite population problem, so for an 
infinite population one could simply set y = k/3/N, where k is defined in equation (|32|). 
To calculate the correlation after selection, we solve equation fl3T|) for y and then 
substitute this value into the equation fl36|) to calculate goo. In general this must be done 
numerically, although the weak selection expansion can be used to obtain an analytical 
result which gives a very good approximation in many cases. Notice that the third 
cumulant in equation ( fiO"D will be negative for K\ > because of the negative entropy 
gradient and this will accelerate the increased correlation under selection. 
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6.2. Duplication term 



The duplication term Agd is defined in equation As in the partition function 

calculation presented in section |5], population members are independently averaged over 
a distribution with the correct cumulants, 

A <?d = P I II dRadE a dq aa p(R a )p(E a \R a )p(q aa \R a ,R 



\a=l 



(l-<7o 



-2(3 E a 



p(f[ fdR D 

\a=l J 



'I - 



, (E a e-^) 2 
q aa ) exp(-2(3E a ) Jlit t exp (-t ]T e"^ 



(41) 



Here, construct which comes from the same statistics as the correlations between 

distinct population members. The integral in t removes the square in the denominator 
and decouples the average, 

> 



Aq d = P dttf{t)g p -\t) 
Jo 



(42) 



where, 



/(*) = JdRdE dqp(R) p(E\R) p(q\R, R) (1 - q) exp(-2(3E - te~^ E ) (43) 

g(t) = JdRdE p(R)p(E\R)exp(-te- f3E ) (44) 

The overlap distribution p(R) will be approximated by the cumulant expansion in 
equation ( ^5|) and p(q\R,R) by the distribution derived in |Appendix B| . In general, 
it would be necessary to calculate these integrals numerically, but the correlation 
distribution is difficult to deal with as it requires the numerical reversion of a saddle 
point equation. 

Instead, we expand for small f3 and large N as we did for the selection calculation 
in section [5J] (this approximation is only used for the term involving the correlation in 
equation fl42| ) for the simulations presented in section |8|). In this case one finds, 



f(t)g F -\t) 



p(2/3)exp 
- p q (2(3) exp 



-t (P - l)p{(3) + 



Mm ) 



where, 



dRdEp(R)p(E\R) e~ pE 
dR dEp(R) p(E\R) fdqp(q\R,R) qe~ pE 



m 

Completing the integral in equation ( |42"D one finds, 
Aq d 



pm-p q m +0 (± 



Pp 2 (/3) 



P 2 



(45) 

(46) 
(47) 

(48) 
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We express p q (/3) in terms of the Fourier transform of the distribution of correlations, 
which is defined in equation ( B15 ). 



p q (J3) = \im-\ogQdRdEp(R)p(E\R)p(-it\R,R)e- pE y((3) (49) 
The integrals can be calculated by expressing p(E\R) by the same approximate form as 



in section 5.1 and using the saddle point method to integrate over the Fourier transform 



as in [Appendix B 



Eventually one finds, 



e" 3 "'" [l-q«,(2kp/N)]p 2 (kp,0) /l 



^ - 1 7m + (50) 

where qoo{y) is defined in equation (|36| ) and p n (fc/3, 7) is defined in equation (|31|). 

It is instructive to expand in (3 as this shows the contributions from each cumulant 
explicitly. To do this we use the cumulant expansion described in equation ( |25| ) and to 
third order in /3 for three cumulants one finds, 

Ag d ~ — [1 - qoo {2kf3/N)} (l + K 2 {kf3) 2 - K 3 (k(3) 3 + 0{f3 4 )) (51) 

The goo term has not been expanded out since it contributes terms of 0(1/N) less than 
these contributions for each cumulant. Selection leads to a negative third cumulant 
(see equation (|33<j) ), which in turn leads to an accelerated increase in correlation under 
further selection. Crossover reduces this effect by reducing the magnitude of the higher 
cumulants. 

7. Dynamic population resizing 

The noise introduced by the finite sized training set increases the magnitude of the 
detrimental finite population terms in selection. In the limit of weak selection and large 
problem size discussed in sections fO and p.'2[ this can be compensated for by increasing 



the population size. The terms which involve noise in equations (^) and floDP can be 
removed by an appropriate population resizing, 

P = P exp[(/3a) 2 ] (52) 

Here, Pq is the population size in the infinite training set, zero noise limit. Since these 
are the only terms in the expressions describing the dynamics which involve the finite 
population size, this effectively maps the full dynamics onto the infinite training set 
case. 

For zero noise the selection strength should be scaled so that the effective selection 
strength k(3 is inversely proportional to the standard deviation of the population [IBj , 
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Here, k is defined in equation ( p2|) and j3 s is the scaled selection strength and 
remains fixed throughout the search. Recall that k 2 is the expected variance of a 
finite population, which is related to the variance of an infinite population through 



equation ( |15t| ). One could also include a factor of ylog P to compensate for changes 
in population size, as in reference [17], but this term is neglected here. The resized 
population is then, 



P = P exp 



P exp 



\M 2 \ 
4 k 2 K 2 ) 

7? s 2 (l — k\) COS -1 (/Ci)(7T — COS _1 (/Ci))' 

\Nk 2 



(54) 



Notice that the exponent in this expression is 0(1), so this population resizing does not 
blow up with increasing problem size. One might therefore expect this problem to scale 
with N in the same manner as the zero-noise, infinite training set long as the 

batch size is 0(N). 

Baum et al have shown that a closely related GA scales as 0{N\o^N) on this 
problem if the population size is sufficiently large so that alleles can be assumed to come 
from a binomial distribution 0. This is effectively a maximum entropy assumption with 
a constraint on the mean overlap alone. They use culling selection, where the best half 
of the population survives each generation leading to a change in the mean overlap 
proportional to the population's standard deviation. Our selection scaling also leads to 
a change in the mean of this order and the algorithms may therefore be expected to 
compare closely. The expressions derived here do not rely on a large population size 
and are therefore more general. 

In the infinite population limit it is reasonable to assume Nk 2 — 1 — k\ which is the 
relationship between mean and variance for a binomial distribution, since in this limit 
the correlation of the population will not increases due to duplication under selection. 
In this case the above scaling results in a monotonic decrease in population size, as K\ 
increases over time. This is easy to implement by removing the appropriate number of 
population members before each selection. 

In a finite population the population becomes correlated under selection and the 
variance of the population is usually less than the value predicted by a binomial 
distribution. In this case the population size may have to be increased, which could 
be implemented by producing a larger population after selection or crossover. This is 
problematic, however, since increasing the population size leads to an increase in the 
correlation and a corresponding reduced performance. In this case the dynamics will no 
longer be equivalent to the infinite training set situation. 
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Instead of varying the population size, one can fix the population size and vary the 
size of the training batches. In this case one finds, 

_ & 2 (1 ~ K?) COS- 1 (/ti)(7T - COS-H/tl)) 

Nk 2 log(P/P ) { } 

Figure |I| shows how choosing the batch size each generation according to 
equation ( j55|) leads to the dynamics converging onto the infinite training set dynamics 
where the training energy is equal to the generalization error. The infinite training 
set result for the largest population size is also shown, as this gives some measure of 
the potential variability of trajectories available under different batch sizing schemes. 
Any deviation from the weak selection, large N limit is not apparent here. To a 
good approximation it seems that the population resizing in equation fl5%D and the 
corresponding batch sizing expression in equation (p5|) are accurate, at least as long as 
A is not too small. 




1 oo 



1 50 



Generation 



Figure 1. The mean overlap between teacher and pupil within the population is shown 
each generation for a GA training a binary perceptron to generalize from examples 
produced by a teacher perceptron. The results were averaged over 100 runs and 
training batch sizes were chosen according to equation (|55|), leading to the trajectories 
converging onto the infinite training set result where E — E g (R). The solid curve is for 
the infinite training set with Pq = 60 and the finite training set results are for P — 90 
(□), 120(o) and 163(A). Inset is the mean choice of A each generation. The dashed 
line is the infinite training set result for P = 163, showing that there is significant 
potential variability of trajectories under different batch sizing schemes. The other 
parameters were N = 279, j3 B = 0.25 and p m = 0.001. 
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7.1. Optimal batch size 

In the previous section it was shown how the population size could be changed to 
remove the effects of noise associated with a finite training set. If we use this population 
resizing then it is possible to define an optimal size of training set, in order to minimize 
the computational cost of energy evaluation. This choice will also minimize the total 
number of training examples presented when independent batches are used. This may 
be expected to provide a useful estimate of the appropriate sizing of batches in more 
efficient schemes, where examples are recycled, as long as the total number of examples 
used significantly exceeds the threshold above which over-training is impossible. 

We assume that computation is mainly due to energy evaluation and note that there 
are P energy evaluations each generation with computation time for each scaling as A. 
If the population size each generation is chosen by equation fl54|), then the computation 
time r c (in arbitrary units) is given by, 

r c = A exp A = ^ ~ ~ ^M) (56) 

The optimal choice of A is given by the minimum of r c , which is at A D . Choosing this 
batch size leads to the population size being constant over the whole GA run and for 
optimal performance one should choose, 

P = P e 1 ~ 2.73P (57) 
A = A (58) 

where Po is the population size used for the zero noise, infinite training set GA. Notice 
that it is not necessary to determine Po i n order to choose the size of each batch, since 
A is not a function of Po- Since the batch size can now be determined automatically, 
this reduces the size of the GA's parameter space significantly. 

One of the runs in figure p] is for this choice of P and A, showing close agreement to 
the infinite training set dynamics (P = 163 ~ Poe). In general, the first two cumulants 
change in a non-trivial manner each generation and their evolution can be determined 
by simulating the dynamics, as described in section ||. 



8. Simulating the dynamics 

In sections [|, |5| and |], difference equations were derived for the mean effect of each 
operator on the mean overlap and correlation within the population. The full dynamics 
of the GA can be simulated by iterating these equations starting from their initial 
values, which are zero. The equations for selection also require knowledge of the higher 
cumulants before selection, which are calculated by assuming a maximum entropy 
distribution with constraints on the two known macroscopics (see equations ( P-8^) 
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to ( |18c| )). We used four cumulants and the selection expressions were calculated 
numerically, although for weak selection the analytical results in section (|5.1|) were also 
found to be very accurate. The largest overlap within the population was estimated 
by assuming population members were randomly selected from a distribution with the 
correct cumulants |jT7[ . This assumption breaks down towards the end of the search, 
when the population is highly correlated and the higher cumulants become large, so 
that four cumulants may not describe the population sufficiently well. 

Figures |2| and ^] show the mean, variance and largest overlap within the population 
each generation, averaged over 1000 runs of a GA and compared to the theory. The 
infinite training set case, where the training energy is the generalization error, is 
compared to results for two values of A, showing how performance degrades as the batch 
size is reduced. Recall that AiV new patterns are shown to each population member, 
each generation, so that the total number of patterns used is AiV x PG, where P is 
population size and G is the total number of generations. The skewness and kurtosis 
are presented in figure [| for one value of A, showing that although there are larger 
fluctuations in the higher cumulants they seem to agree sufficiently well to the theory 
on average. It would probably be possible to model the dynamics accurately with only 
three cumulants, since the kurtosis does not seem to be particularly significant in these 
simulations. 




O 50 100 150 



Generation 

Figure 2. The theory is compared to averaged results from a GA training a binary 
perceptron to generalize from examples produced by a teacher perceptron. The mean 
and variance of the overlap distribution within the population are shown, averaged 
over 1000 runs, with the solid lines showing the theoretical predictions. The infinite 
training set result (O) is compared to results for a finite training set with A — 0.65 (□) 
and A = 0.39 (A). The other parameters were N — 155, (3 S = 0.3, p m = 0.005 and the 
population size was 80. 
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Figure 3. The maximum overlap between teacher and pupil is shown each generation, 
averaged over the same runs as the results presented in figure |^. The solid lines show 
the theoretical predictions and the symbols are as in figure 0. 
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Generation 

Figure 4. The skewness and kurtosis of the overlap distribution are shown averaged 
over the same runs as the results presented in figure ^ for A = 0.65. Averages were 
taken over cumulants, rather than the ratios shown. The solid lines show the theoretical 
predictions for mean behaviour. 

These results show excellent agreement with the theory, although there is a slight 
underestimate in the best population member for the reasons discussed above. This is 
typical of the theory, which has to be very accurate in order to pick up the subtle effects 
of noise due to the finite batch size. Unfortunately, the agreement is less accurate for low 
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values of A, where the noise is stronger. This may be due to two simplifications. Firstly, 
we use a Gaussian approximation for the noise which relies on A being at least 0(1). This 
could be remedied by expanding the noise in terms of more than two cumulants as we 
have done for the overlap distribution. Secondly, the duplication term in section B72] uses 



the large N, weak selection approximation which also relies on A being 0(1). The error 
due to this approximation is minimized by only using the approximation for the term 
involving the correlation in equation (fJ2j), with the other term calculated numerically. 
It is expected that good results for smaller values of A would be possible for larger values 
of N, where the correlation calculation would be more exact. 



9. Conclusion 

A statistical mechanics formalism has been used to solve the dynamics of a GA for a 
simple problem from learning theory, generalization in a perceptron with binary weights. 
To make the dynamics tractable, the case where a new batch of examples was presented 
to each population member each generation was considered. For 0(N) training examples 
per batch the training energy was well approximated by a Gaussian distribution whose 
mean is the generalization error and whose variance increases as the batch size is reduced. 
The use of bit-simulated crossover, which takes the population straight to the fixed 
point of standard crossover, allowed the dynamics to be modelled in terms of only 
two macroscopics; the mean correlation and overlap within the population. The higher 
cumulants of the overlap distribution after crossover were required to calculate the effect 
of selection and were estimated by assuming maximum entropy with respect to the two 
known macroscopics. By iterating difference equations describing the average effect 
of each operator on the mean correlation and overlap the dynamics of the GA were 
simulated, showing very close agreement with averaged results from a GA. 

Although the difference equations describing the effect of each operator required 
numerical enumeration in some cases, analytical results were derived for the weak 
selection, large N limit. It was shown that in this limit a dynamical resizing of the 
population maps the finite training set dynamics onto the infinite training set situation. 
Using this resizing it is possible to calculate the most computationally efficient size 
of population and training batch, since there is a diminishing return in improved 
performance as batch size is increased. For the case of independent training examples 
considered here this choice also gives the minimum total number of examples presented. 

In future work it would be essential to look at the situation where the patterns are 
recycled, leading to a much more efficient use of training examples and the possibility of 
over-training. In this case, the distribution of overlaps between teacher and pupil would 
not be sufficient to describe the population, since the training energy would then be 
dependent on the training set. One would therefore have to include information specific 
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to the training set, such as the mean pattern per site within the training set. This might 
be treated as a quenched field at each site, although it is not obvious how one could 
best incorporate such a field into the dynamics. 

Another interesting extension of the present study would be to consider multi-layer 
networks, which would present a much richer dynamical behaviour than the single-layer 
perceptron considered here. This would bring the formalism much closer to problems of 
realistic difficulty. In order to describe the population in this case it would be necessary 
to consider the joint distribution of many order parameters within the population. It 
would be interesting to see how the dynamics of the GA compares to gradient methods 
in networks with continuous weights, for which the dynamics of generalization for a class 
of multi-layer architectures have recently been solved analytically in the case of on-line 
learning . In order to generalize in multi-layer networks it is necessary for the search 
to break symmetry in weight space and it would be of great interest to understand how 
this might occur in a population of solutions, whether it would occur spontaneously over 
the whole population in analogy to a phase transition or whether components would 
be formed within the population, each exhibiting a different broken symmetry. This 
would again require the accurate characterization of finite population effects, since an 
infinite population might allow the coexistence of all possible broken symmetries, which 
is presumably an unrealizable situation in finite populations. 
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Appendix A. The maximum entropy distribution 

After bit-simulated crossover the population is assumed to be at maximum entropy with 
constraints on the mean overlap and correlation within the population. This is a special 
case of the result derived for the paramagnet by Priigel-Bennett and Shapiro [|17[] and 
this discussion follows theirs closely. 

Let Wi be the mean weight at site % within the population, 

W = W« = i f>? (Al) 
To calculate the distribution of this quantity over sites one imposes constraints on the 
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mean overlap and correlation with Lagrange multipliers x and z, 

P N p N 

-PKi = |EE< = i"E^ (A2) 

a=l i=l i=l 
/„p\2 2 P P JV ^ p\2 JV 

^^EEE<«f^E»? (A3) 

ct=l p=l 1=1 1=1 

Recall that we have chosen ti — 1 at each site without loss of generality. The correlation 
expression is for large P and finite population corrections can be included retrospectively. 

Without constraints, the fraction of positive weights at site i is given by a binomial 
coefficient, 

°w> = f ( P(1 / wy/2 ) < A4 > 

So one can define an entropy, 

S^Wi) = log[fi(Wi)] 

._| log(1 _^ ) + ^ log( l^) (A5) 

where Stirling's approximation has been used. One can then define a probability 
distribution for the {Wi} configuration which decouples at each site, 

N N 

p({Wi}) = Up( W i) = U c MS(W t ) + zPW t + (xPW t ) 2 /2] (A6) 

i=l i=l 

drjj ____(-r] 2 
where 

G(Wi, Vi) = S{Wi)/P + zWi + xrnWi (A8) 

The maximal value of G with respect to Wi gives the maximum entropy distribution for 
Wi at each site. This leads to the expression, 

Wi = tanh(z + xrji) (A9) 

where rji is drawn from a Gaussian with zero mean and unit variance. The constraints 
can be used to obtain values for the Lagrange multipliers, 

1 N 



p(Wi) = /-/=exp (^ + PG(W i ,r 1 i)) (A7) 



Ki = -Y.^Hz + x^) (A10) 

iV i=i 



1 " 



9 = ^E tanh2 (^ + ^) ( An ) 

iv 1=1 

The bars denote averages over the Gaussian noise which in general must be done 
numerically. 
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The third and fourth order terms in equations ( p.8dj and (|18cj ) can be found once 
the Lagrange multipliers have been determined, 
1 n 

xf£«>a = tanh 3 (z + ^) (A12) 
1 N 



TfE«>« = tanh 4 (z + xr?) (A13) 
Again, the bars denote averages over the Gaussian noise. 

Appendix B. The distribution of correlations 

Rewriting equation (|35|) we have, 



goo = Jdq a p dR a dRpp s (R a )p s (Rp) p(q a p\R a , Rp) q a p 

= lira ^ log (ykiZa di^ p s (E a ) p s (i^) p(-it|i4, i^)) (B14) 

where p(—it\R a , Rp) is the Fourier transform of p(q a p\R a , Rp), 

p(-it\R a ,Rp) = Jdq a p P (q a p\R a ,Rp)e tq ^ (B15) 

The conditional probability for correlations p(q a p\R a , Rp) can be defined if weights are 
assumed to come from the maximum entropy distribution defined in |Appendix A| . In 
this case one has, 

p(qap, R a , Rp) 



p(q a p\Ra, R, 



•p) 



P{Ra,R, 



p) 



_ (%og -jfZi w>?)5(R a -jjEi <)5(Rp wj)) (B16) 

where the angled brackets denote averages over wf and wf . The weights at each site 
are distributed according to, 

P ( Wl ) = (^P) s(wi - 1) + C-^p-) j k + 1) ( Bl7 ) 



Here, Wi is the mean weight per site, defined in equation (|A9|) . 

We consider the Fourier transform of p(q a p\R a ,Rp) since this appears in the 
appropriate generating function, 

Writing the delta functions as integrals and noting that one of the integrals is removed 
by the Fourier transform, one finds (ignoring multiplicative constants), 

p(-it,R a ,Rp) = ( / 10 °dy Q d^exp(F)\ (B19) 

W-ioo / { w f,wf} 
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1 N 

F = -y a R a - ypRp + — ^2(y» w i + VP w i + tw i w i) 

i=l 

Each site decouples and the average over sites can be taken by integrating over the 
weight distribution defined in equation QB17|) . The resulting integral can be computed 
for large N by the saddle point method since the exponent can be made extensive by 
appropriate rescaling. Eventually one finds (ignoring multiplicative constants), 

p(-it, R a , Rp) = exp{-y a R a - ypRp + G) (B20) 

1 N 

G = - £log [(1 + Wi) 2 e t+y " +y e + 2(1 - Wf )e~* cosh(y a - yp) + (1 - Wtfe^-^e 
^ i=i 

The saddle point equations fix y a and yp as implicit functions of R a , Rp and t, 

Rfi = ^~ (B21) 
oy a oyp 

Define p( — it), whose logarithm is the generating function for q^, 
p(-it) = JdR a dRp p s (R a ) p s (Rp) p(-it\R a ,Rp) 

= JdR a dRp Ps (R a )p s (Rp)ex P [G(t)-G(0)] (B22) 
We express the overlap distributions by their Fourier transformed cumulant expansions, 

/ioo / n n \ 

daexp £ — K s n -aR a ) (B23) 
-ioo \ Til / 

/ io ° / b n \ 

d&exp £-^-6^ (B24) 
-ioo \ Til J 

Now p{— it) is an integral over a, b, R a and Rp which can again be computed by the 
saddle point method. One finds that as t — > 0, the saddle point equations are satisfied 
by, 

y a = yp = y (B25) 

R a = Rp = Kl (B26) 

These are related through an implicit function for y in terms of mean overlap after 
selection, 

= If W. + tanhfa) 
Then the natural increase contribution for the correlation after selection is given by, 

= l ir £ J; lo g^( _i *) 

t-»o at 

»'Wi + tanh(»)V 



+ Wi tanhfe) ) 
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